Doing Data Science (ch 1 and 2)
The human face of big data: Rick Smolan
Introduction to Information Retrieval - Ch 13
In [1]:
# Use this for interactive plots
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
pd.Series([1,2,3,4]).plot()
Out[1]:
When asking the professor a question, use the STAR approach: Situation, Task, Action, Result.
You must participate in the online Google group to get full participation credit.
Data modeling pipeline
Bias-variance tradeoff
Squared bias = amount by which the expected model prediction differs from the true value over the training data
Variance = amount by which prediction over one training set differs from the expected predicted value over all training sets
$y = f(x) + \sigma^2$
Formula of bias = $E[h(x^*)]-f(x^*)$
Formula of variance = $E[h(x^*)-E[h(x^*)])^2]$
Formula of irreducible noise = $E[(y-f(x^*))^2] = E[\epsilon^2] = \sigma^2$
$h(x^*)$ is the model's prediction
$f(x^*)$ is the true value of the function
$y$ is the actual value
Standard procedure for calculating:
Before starting anything, do a back of an envelope calculation first.
Good rule of thumb. Reading 1TB = 3 hours on your machine. A computer will fail on average after 1000 days.
Allowed commands
In [2]:
!grep Guido data/week1/LICENSE.txt
In [3]:
!cat data/week1/LICENSE.txt data/week1/LICENSE.txt | wc -w
In [4]:
!cat data/week1/LICENSE.txt data/week1/LICENSE.txt | head
In [5]:
%%bash
for term in Python Guido Scala license
do
grep $term data/week1/LICENSE.txt | wc -l
done
In [6]:
%%bash
for ((num=0; num<=5; num++))
do
echo "I have $num cats"
done
In [7]:
%%bash
tail -n 115 data/week1/LICENSE.txt | head | cut -f 1-2 -d " "
In [8]:
%%bash
find /Users/BlueOwl1/Documents -name pdf | paste -s -d : - | cat
In [13]:
!echo "scale=10; 4.32*(3/7)+1.23" | bc
In [41]:
%%bash
for num in {1..10}
do
# For modulo to work, scale must be 0
echo "(1+$num) % 3" | bc
done
In [95]:
%%bash
for num in {1..20}
do
if [ $[$num%15] = 0 ]; then
echo fizzbuzz
elif [ $[$num%3] = 0 ]; then
echo fizz
elif [ $[$num%5] = 0 ]; then
echo buzz
else
echo $num
fi
done
In [105]:
%%bash
seq 15 | paste -sd+ -
The wait command forces the system to finish processing the child process before the next command is run.
In [108]:
%%bash
seq 1000000 | wc &
echo "Finished waiting"
In [111]:
%%bash
seq 1000000 | wc &
wait; echo "Finished waiting"
Parallel grep (my attempt)
In [172]:
%%bash
mkdir pgrep_temp_files
cd pgrep_temp_files
split -l 10 -a 5 ../week1/CountOfMonteCristo.txt pgrep_temp_files
for file in pgrep_temp_files*
do
grep "Python" $file &
done
# Return to original directory
cd ..
# Remove file that was created
rm -fr pgrep_temp_files
In [166]:
%%timeit
!grep Python week1/CountOfMonteCristo.txt > /dev/null
MapReduce Basics - Chapter 2